Now that we have a clean dataset let's dive into it and find some useful insights
Exploratory Data Analysis - High School
# Import of libraries used in the script
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import plotly.express as px
import plotly.graph_objects as gp
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans
from utils import build_variables_dict, get_plotly_df, update_fig_layout
# Set display options to ensure all columns and rows are displayed when using functions like df.head()
pd.set_option('display.max_columns', None) # Set maximum number of columns to display to None (unlimited)
pd.set_option('display.max_rows', None) # Set maximum number of rows to display to None (unlimited)
pd.set_option('display.precision', 2) # Set precision for float numbers to 2 decimal places
pd.set_option('display.max_colwidth', None) # Set maximum column width to None (unlimited)
# Read high school clean dataset
file_path = '../../results/data/high_school_dataset_clean.csv'
df = pd.read_csv(file_path, sep=',')
C:\Users\jairo\AppData\Local\Temp\ipykernel_1492\2381607999.py:3: DtypeWarning: Columns (36,43,44,59,270,298,803,806,809,1015) have mixed types. Specify dtype option on import or set low_memory=False. df = pd.read_csv(file_path, sep=',')
clean_data_df = df.copy()
clean_data_df['PERIODO'] = clean_data_df['PERIODO'].apply(lambda x: '{:.0f}'.format(x))
clean_data_df_ITESO = clean_data_df[clean_data_df['CCT_INS_PLA']=='14MMS0519C']
Let's understand ITESO's High School Profile
group_by_dict = {
'PERIODO':'year',
'NOMBRE_INS_PLA':'institution',
'C_MODALIDAD':'modality',
'C_OPCION_EDUCATIVA':'option',
'C_TURNO':'shift'
}
variables_dict = build_variables_dict(['V692'])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)
title='Number of high school students enrolled at ITESO per year, shift, modality and option'
fig = px.histogram(
plotly_df, x="year", y="students",
title=title,
color='shift',
text_auto='.s',
facet_col='modality',
facet_row='option',
barmode='group'
)
fig = update_fig_layout(fig, "high_school", title, height=800)
fig.show()
plotly_df
| year | institution | modality | option | shift | students | sex | age | type | grade | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019 | ITESO | MIXTA | MIXTA | NOCTURNO | 69 | all | Total | total | total |
| 1 | 2020 | ITESO | MIXTA | MIXTA | NOCTURNO | 0 | all | Total | total | total |
| 2 | 2021 | ITESO | ESCOLARIZADA | PRESENCIAL | MATUTINO | 192 | all | Total | total | total |
| 3 | 2022 | ITESO | ESCOLARIZADA | PRESENCIAL | MATUTINO | 394 | all | Total | total | total |
As we can see from both the table and the graph, there have been some changes in how ITESO operates.
In 2021, it changed:
- Modality: From a mixed modality to formal education ("Escolarizada")
- Educational Option: From mixed to formal on-site ("Presencial")
- Shift: From nighttime to daytime ("Matutino")
Taking out 2020 because of COVID, we can see an increase in the number of enrolled students each year, with a growth of enrolled students of 178% from 2019 to 2021, and a 105% growth the following year
plotly_df_filtered = plotly_df[plotly_df['year'] != '2020']
plotly_df_filtered.loc[:, 'Growth Rate'] = plotly_df_filtered['students'].pct_change() * 100
plotly_df_filtered
C:\Users\jairo\AppData\Local\Temp\ipykernel_1492\1500618490.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| year | institution | modality | option | shift | students | sex | age | type | grade | Growth Rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019 | ITESO | MIXTA | MIXTA | NOCTURNO | 69 | all | Total | total | total | NaN |
| 2 | 2021 | ITESO | ESCOLARIZADA | PRESENCIAL | MATUTINO | 192 | all | Total | total | total | 178.26 |
| 3 | 2022 | ITESO | ESCOLARIZADA | PRESENCIAL | MATUTINO | 394 | all | Total | total | total | 105.21 |
Taking into account the changes made in 2021, I'll procede to estimate the number of students for 2023, 2024 and 2025 using a linear regressions with the data from 2021 and 2022.
X = plotly_df_filtered[1:3][['year']]
y = plotly_df_filtered[1:3]['students']
model = LinearRegression()
model.fit(X, y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
predicted_values = model.predict([[2023], [2024], [2025]])
predicted_values
C:\Users\jairo\Documents\GitHub\datachallenge\env\lib\site-packages\sklearn\base.py:493: UserWarning: X does not have valid feature names, but LinearRegression was fitted with feature names
array([ 596., 798., 1000.])
data = {
'year': [*plotly_df['year'].tolist(), '2023 *', '2024 *', '2025 *'],
'students': [*plotly_df['students'].tolist(),
int(predicted_values[0]),
int(predicted_values[1]),
int(predicted_values[2])]
}
df_plot = pd.DataFrame(data)
title='Number of high school students enrolled at ITESO per year'
fig = px.bar(
df_plot, x='year', y='students',
title=title,
category_orders={'Year': [2019, 2021, 2022]},
text_auto='.s',
)
fig.update_layout(
xaxis_title='Year',
yaxis_title='Number of Students',
annotations=[
dict(
x=0.7, y=1,
xref='paper', yref='paper',
text="* estimated data",
showarrow=False, textangle=0,
)
]
)
fig = update_fig_layout(fig, "high_school", title)
fig.show()
plotly_df_filtered = df_plot[df_plot['year'] != '2020']
plotly_df_filtered.loc[:, 'Growth Rate'] = plotly_df_filtered['students'].pct_change() * 100
plotly_df_filtered
C:\Users\jairo\AppData\Local\Temp\ipykernel_1492\2493306804.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| year | students | Growth Rate | |
|---|---|---|---|
| 0 | 2019 | 69 | NaN |
| 2 | 2021 | 192 | 178.26 |
| 3 | 2022 | 394 | 105.21 |
| 4 | 2023 * | 596 | 51.27 |
| 5 | 2024 * | 798 | 33.89 |
| 6 | 2025 * | 1000 | 25.31 |
As I mentioned before, ITESO experienced a 105% growth rate in 2022. Using a linear regression model, I estimated the values for the next three years and also projected the growth rate. Based on the available data, I estimate that ITESO will have 1,000 enrolled students by 2025.
We also need to consider that this is an estimation based on a very limited amount of data available and it doesn't factor in other variables such as the potential student population, ITESO's infrastructure, and costs.
Now let's take a look into the students distribution
# Agregating data of enrolled students
group_by_dict = {
'PERIODO':'year',
'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict([f"V{i}" for i in range(402, 692 + 1)])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)
plotly_df_filtered = plotly_df
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['institution']=='ITESO']
# plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['year']=='2019']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['type']!='total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['age']!='Total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['grade']=='total']
# Agregating data of enrolled students 2019
plotly_df_filtered_2019 = plotly_df_filtered[plotly_df_filtered['year']=='2019']
grouped_df = plotly_df_filtered_2019.groupby(['age', 'sex'])['students'].sum().unstack(fill_value=0)
grouped_df = grouped_df.reset_index()
y_age = grouped_df['age']
x_M = grouped_df['male']
x_F = grouped_df['female'] * -1
# Creating instance of the figure and adding data
fig_2019 = gp.Figure()
fig_2019.add_trace(gp.Bar(y=y_age, x=x_M, name='Male', orientation='h', text=grouped_df['male']))
fig_2019.add_trace(gp.Bar(y=y_age, x=x_F, name='Female', orientation='h', text=grouped_df['female']))
title = "Age distribution of ITESO's 2019 high school student population"
fig_2019.update_layout(title = title,
barmode = 'relative',
bargap = 0.1,
height=600,
yaxis_title='Age',
xaxis = dict(
tickvals = [-2000, -1000, -500, -200, -100, -50, -10, 0, 10, 50, 100, 200, 500, 1000, 2000],
ticktext = ['2K', '1K', '500', '200', '100', '50', '10', '0', '10', '50', '100', '200', '500', '1K', '2K'],
title = 'Number of student')
)
fig_2019 = update_fig_layout(fig_2019, "high_school", title)
# fig_2019.show()
# Agregating data of enrolled students 2022
plotly_df_filtered_2022 = plotly_df_filtered[plotly_df_filtered['year']=='2022']
grouped_df = plotly_df_filtered_2022.groupby(['age', 'sex'])['students'].sum().unstack(fill_value=0)
grouped_df = grouped_df.reset_index()
y_age = grouped_df['age']
x_M = grouped_df['male']
x_F = grouped_df['female'] * -1
# Creating instance of the figure and adding data
fig_2022 = gp.Figure()
fig_2022.add_trace(gp.Bar(y=y_age, x=x_M, name='Male', orientation='h', text=grouped_df['male']))
fig_2022.add_trace(gp.Bar(y=y_age, x=x_F, name='Female', orientation='h', text=grouped_df['female']))
title = "Age distribution of ITESO's 2022 high school student population"
fig_2022.update_layout(title = title,
barmode = 'relative',
bargap = 0.1,
height=600,
yaxis_title='Age',
xaxis = dict(
tickvals = [-2000, -1000, -500, -200, -100, -50, -10, 0, 10, 50, 100, 200, 500, 1000, 2000],
ticktext = ['2K', '1K', '500', '200', '100', '50', '10', '0', '10', '50', '100', '200', '500', '1K', '2K'],
title = 'Number of student')
)
fig_2022 = update_fig_layout(fig_2022, "high_school", title)
# fig_2022.show()
fig_2019.show()
fig_2022.show()
Consistent with the changes implemented in 2021, changing shift from nighttime to daytime, we can see a change in student demographics from 2019 to 2022.
In 2019 the students were older, I infer the students were mainly adults who went to school after work. The shift to daytime classes since 2021 suggests a student population more typical of a traditional school setting, with students of school-age.
Now let's explore the student distribution by grades
group_by_dict = {
'PERIODO':'year',
'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict([f"V{i}" for i in range(402, 692 + 1)])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)
plotly_df_filtered = plotly_df
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['institution']=='ITESO']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['type']=='total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['age']=='Total']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['grade']!='total']
title='Number of high school students enrolled at ITESO per year and grade'
fig = px.histogram(
plotly_df_filtered, x="grade", y="students",
title=title,
text_auto='.s',
facet_col='year',
category_orders={"institution": ["average"] + sorted(plotly_df['institution'].unique())},
)
fig.update_layout(yaxis_title='Number of Students')
fig = update_fig_layout(fig, "high_school", title)
fig.show()
This is interesting. By examining the distribution and the previous graphs, we can see that ITESO High School rebranded itself and has been operating as a brand-new school since 2021.
With the changes implemented in 2021, we already observed a redistribution in the student demographics. This new graph shows a redistribution in grades
Let's explore scholarships
group_by_dict = {
'PERIODO':'year',
'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict([f"V{i}" for i in range(200, 261 + 1)])
plotly_df = get_plotly_df(clean_data_df_ITESO, variables_dict, group_by_dict)
plotly_df_filtered = plotly_df
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['institution']=='ITESO']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['students']>0]
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['sex']=='all']
plotly_df_filtered = plotly_df_filtered[plotly_df_filtered['scholarship_type']!='Total']
variable V230 was not found! variable V258 was not found!
plotly_df_filtered['year'] = pd.to_numeric(plotly_df_filtered['year'])
plotly_df_filtered['year'] = plotly_df_filtered['year'] - 1
plotly_df_filtered
| year | institution | students | sex | scholarship_type | detail | |
|---|---|---|---|---|---|---|
| 612 | 2018 | ITESO | 46 | all | Beca de la propia institución | NaN |
| 723 | 2021 | ITESO | 177 | all | Beca particular | NaN |
title="High school students enrolled at ITESO with scholarship per year and scholarship type"
plotly_df_filtered = plotly_df_filtered.rename(columns={'scholarship_type': 'Scholarship Types', 'students': 'Number of Students'})
fig = px.bar(
plotly_df_filtered, x='Scholarship Types', y='Number of Students',
title=title,
barmode='group',
facet_col='year',
text_auto='.s'
)
fig = update_fig_layout(fig, "high_school", title)
fig.show()
It's worth mentioning that the information reported about scholarships relates to the previous year. That's why, even though the last reported year was 2022, the most recent scholarship information we have is for 2021.
Having 177 students in 2021 with scholarships is an important figure, considering that the total number of students enrolled in 2021 was 192; this represents 92.2% of the students
year = '2022'
students_enrolled_with_scholarship_2021 = int(clean_data_df_ITESO[clean_data_df_ITESO['PERIODO']==year]['V261'])
students_enrolled_2021 = int(clean_data_df_ITESO[clean_data_df_ITESO['PERIODO']==str(int(year)-1)]['V692'])
students_enrolled_without_scholarship_2021 = students_enrolled_2021 - students_enrolled_with_scholarship_2021
data = {'status': ['Students with scholarship', 'Students without scholarship'], 'students': [students_enrolled_with_scholarship_2021, students_enrolled_without_scholarship_2021]}
df = pd.DataFrame(data)
title = 'Enrolled high school students at ITESO in 2021 with and without scholarships'
fig = px.pie(
df, values='students', names='status',
title=title
)
fig = update_fig_layout(fig, "high_school", title)
fig.show()
In order to evaluate the purchase of ITESO High School, we need to consider that there is no financial information available, such as tuition value to calculate income or operating costs. This limitation restricts the scope of how we can compare schools within a financial context.
Another method to evaluate school performance and rank them would be by using a standardized score, such as the SATs in the United States. In Mexico, the equivalent is the 'Examen Nacional de Ingreso a la Educación Media Superior' (EXANI-I), but unfortunately, we don't have access to this information either.
The best proxy to evaluate the school could be the actual number of enrolled students, their growth rate, and the number of students with outstanding abilities. Although the assessment of outstanding abilities may be subjective, it is the best metric available.
Before we continue, another important consideration is that high schools operate in a different context than universities or graduate schools. Unlike universities or graduate schools, where students typically relocate to attend campus, high school students usually do not move to a different city. Instead, families may move to a city for various reasons and then seek out the best school they can afford. Therefore, it is not appropriate to compare high schools that are localted in different cities.
Let's start by calculating the number of enrolled students and their growth rate.
# Get the city where ITESO High School is located
ITESO_city = clean_data_df_ITESO['CV_MUN'].unique()[0]
# Get the data of enrolled students in 2021 and 2022
df_filtered = clean_data_df[(clean_data_df['CV_MUN'] == ITESO_city) & (clean_data_df['PERIODO'].isin(['2021', '2022']))]
plotly_df = df_filtered.groupby(['NOMBRE_INS_PLA', 'C_TURNO', 'PERIODO'])['V692'].sum().unstack(fill_value=0).reset_index()
# Calculate the growth_percent
plotly_df["growth_percent"] = ((plotly_df["2022"] - plotly_df["2021"]) / plotly_df["2021"]) * 100
# Filter schools that don't have any enrolled students in 2021 or 2022, or those that have a negative growth percent,
# this schools are already performing worst that ITESO and I won't consider them as competitors
plotly_df = plotly_df[(plotly_df['2021'] > 0) & (plotly_df['2022'] > 0) & (plotly_df['growth_percent']> 0)]
plotly_df_melted = plotly_df.melt(id_vars=['NOMBRE_INS_PLA', 'C_TURNO'], var_name='PERIODO', value_name='students')
plotly_df_melted = plotly_df_melted[plotly_df_melted['PERIODO'] == '2022']
plotly_df_melted.head()
| NOMBRE_INS_PLA | C_TURNO | PERIODO | students | |
|---|---|---|---|---|
| 34 | CECYTE EMSAD 63 TURUNDEO | MATUTINO | 2022 | 116.0 |
| 35 | CECYTE EMSAD NUM.77 "FLOR BATAVIA" | MATUTINO | 2022 | 151.0 |
| 36 | CENTRO DE ESTUDIOS ADMINISTRATIVOS DE OCCIDENTE | VESPERTINO | 2022 | 29.0 |
| 37 | CENTRO DE ESTUDIOS UNIVERSITARIOS VERACRUZ | MATUTINO | 2022 | 249.0 |
| 38 | CENTRO EDUCATIVO MARSELLA | MATUTINO | 2022 | 129.0 |
plotly_df_melted['NOMBRE_INS_PLA'].unique().shape
(26,)
Great, now we can see that there are 25 high schools in Jalisco besides ITESO. Let's find out which ones are targeting a similar group of students
title='High school students enrolled in 2022 per Institutions and shift located in Jalisco'
fig = px.histogram(
plotly_df_melted, x="students", y="NOMBRE_INS_PLA",
title=title,
color='C_TURNO',
text_auto='.s'
)
fig.update_yaxes(categoryorder="total ascending")
fig.update_layout(xaxis_title='Number of Students', yaxis_title='Institution')
fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig = update_fig_layout(fig, "high_school", title)
fig.show()
We can see that by the number of enrolled students in 2022, ITESO ranks 8th out of 26. Additionally, there are schools that operate in multiple shifts, such as "Preparatoria 6", which offers "Matutino", "Vespertino", and "Discontinuo". This allows it to enroll more students.
Once again, we'll filter the data to match ITESO's criteria, so we'll be filtering schools that offer the "Matutino" shift like ITESO.
As mentioned before, now that we have the number of enrolled students, I'll incorporate the number of students with outstanding abilities. This will help us classify schools that are similar and provide a better understanding of the schools. For this, I'll be using the K-Means clustering technique.
ITESO_city = clean_data_df_ITESO['CV_MUN'].unique()[0]
df_filtered = clean_data_df[clean_data_df['CV_MUN']==ITESO_city]
df_filtered = df_filtered[df_filtered['PERIODO'].isin(['2021', '2022'])]
df_filtered = df_filtered[df_filtered['C_TURNO']=='MATUTINO']
plotly_df=df_filtered.groupby(['CCT_INS_PLA', 'NOMBRE_INS_PLA', 'PERIODO'])['V692'].sum().unstack(fill_value=0).reset_index()
plotly_df["growth_percent"] = ((plotly_df["2022"] - plotly_df["2021"]) / plotly_df["2021"]) * 100
plotly_df = plotly_df[(plotly_df['2021'] > 0) & (plotly_df['2022'] > 0) & (plotly_df['growth_percent']> 0)]
plotly_df=plotly_df.reset_index(drop=True)
# plotly_df
temp = clean_data_df[
(clean_data_df['CCT_INS_PLA'].isin(plotly_df['CCT_INS_PLA'])) &
(clean_data_df['PERIODO']=='2022') &
(clean_data_df['V692']>0)
]
temp2=temp.groupby(['CCT_INS_PLA'])[['V942', 'V945', 'V948', 'V951', 'V954']].sum().reset_index()
plotly_df = pd.merge(plotly_df, temp2, on='CCT_INS_PLA', how='left')
plotly_df.head()
| CCT_INS_PLA | NOMBRE_INS_PLA | 2021 | 2022 | growth_percent | V942 | V945 | V948 | V951 | V954 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14MMS0078X | COLEGIO DE BACHILLERES 5 | 566 | 589 | 4.06 | 28 | 54 | 54 | 64 | 58 |
| 1 | 14MMS0081K | COLEGIO DE BACHILLERES 8 | 211 | 230 | 9.00 | 0 | 0 | 0 | 0 | 0 |
| 2 | 14MMS0278V | ESCUELA PREPARATORIA SANTA MARIA TEQUEPEXPAN | 246 | 259 | 5.28 | 0 | 0 | 0 | 0 | 0 |
| 3 | 14MMS0326O | CENTRO DE ESTUDIOS UNIVERSITARIOS VERACRUZ | 175 | 249 | 42.29 | 0 | 0 | 0 | 0 | 0 |
| 4 | 14MMS0385D | INSTITUTO LIDERES DEL SIGLO | 163 | 169 | 3.68 | 0 | 0 | 0 | 0 | 0 |
# I'll be using the number of enrolled students in 2022, the groth percent and
# number of students with outstanding abilities to run the kmeans model
k = 5
kmeans = KMeans(n_clusters=k, random_state=0)
enrollment_data = plotly_df[["2022", 'growth_percent', 'V942', 'V945', 'V948', 'V951', 'V954']]
# Fit the model to the data
kmeans.fit(enrollment_data)
# Assign the cluster to each school
plotly_df["cluster"] = kmeans.labels_
plotly_df = plotly_df.sort_values(by='cluster', ascending=True)
cluster_colors_map = {}
for i in range(k):
cluster_colors_map.update({i:str(i)})
colors = [cluster_colors_map.get(label, px.colors.qualitative.Plotly[0]) for label in plotly_df["cluster"]]
plotly_df['label'] = plotly_df['NOMBRE_INS_PLA'].apply(lambda x: x if x == 'ITESO' else '')
title= "School Enrollment Distribution (2021 vs. 2022) with Growth"
fig = px.scatter(
plotly_df, x="2022", y="growth_percent",
title=title,
size="growth_percent", size_max=50,
color=colors, opacity=0.7,
hover_name='NOMBRE_INS_PLA', hover_data={'growth_percent':':.2f','cluster':True},
labels={"2022": "Students in 2022", "growth_percent":"Growth percent", "cluster":"Cluster"},
text='label'
)
fig.update_layout(legend_title_text="Cluster" if colors else None)
fig.update_layout(xaxis_title='Number of Students in 2022', yaxis_title='Growth percentage')
fig = update_fig_layout(fig, "high_school", title)
fig.show()
C:\Users\jairo\Documents\GitHub\datachallenge\env\lib\site-packages\joblib\externals\loky\backend\context.py:136: UserWarning:
Could not find the number of physical cores for the following reason:
found 0 physical cores < 1
Returning the number of logical cores instead. You can silence this warning by setting LOKY_MAX_CPU_COUNT to the number of cores you want to use.
File "C:\Users\jairo\Documents\GitHub\datachallenge\env\lib\site-packages\joblib\externals\loky\backend\context.py", line 282, in _count_physical_cores
raise ValueError(f"found {cpu_count_physical} physical cores < 1")
We can see that ITESO was classified into a cluster along with three other schools. However, ITESO's growth is greater than the other schools'. Let's plot the same data with the new cluster information
title='High school students enrolled in 2022 per Institutions (clustered) located in Jalisco'
fig = px.histogram(
plotly_df, x="2022", y="NOMBRE_INS_PLA",
title=title,
text_auto='.s',
color='cluster'
)
fig.update_yaxes(categoryorder="total ascending")
fig.update_traces(textfont_size=12, textangle=0, cliponaxis=False)
fig.update_layout(xaxis_title='Number of Students', yaxis_title='Institutions')
fig = update_fig_layout(fig, "high_school", title)
fig.show()
With this new information, we see that ITESO has three close competitors in the short term. However, in the future, there are schools with a greater number of students.
One last graph I would like to see is one that shows the number of students with outstanding abilities by type of ability
ITESO_city = clean_data_df_ITESO['CV_MUN'].unique()[0]
df_filtered = clean_data_df[clean_data_df['CV_MUN']==ITESO_city]
df_filtered = df_filtered[df_filtered['PERIODO'].isin(['2021', '2022'])]
df_filtered = df_filtered[df_filtered['C_TURNO']=='MATUTINO']
plotly_df=df_filtered.groupby(['CCT_INS_PLA', 'NOMBRE_INS_PLA', 'PERIODO'])['V692'].sum().unstack(fill_value=0).reset_index()
plotly_df["growth_percent"] = ((plotly_df["2022"] - plotly_df["2021"]) / plotly_df["2021"]) * 100
plotly_df = plotly_df[(plotly_df['2021'] > 0) & (plotly_df['2022'] > 0) & (plotly_df['growth_percent']> 0)]
plotly_df=plotly_df.reset_index(drop=True)
plotly_df = clean_data_df[
(clean_data_df['CCT_INS_PLA'].isin(plotly_df['CCT_INS_PLA'])) &
(clean_data_df['PERIODO']=='2022') &
(clean_data_df['V692']>0)
]
group_by_dict = {
'PERIODO':'year',
'NOMBRE_INS_PLA':'institution',
}
variables_dict = build_variables_dict(['V942', 'V945', 'V948', 'V951', 'V954'])
plotly_df = get_plotly_df(plotly_df, variables_dict, group_by_dict)
plotly_df = plotly_df[plotly_df['students']>0]
# plotly_df
title='Number of students with outstanding abilities in 2022 by institution and type of ability'
fig = px.bar(
plotly_df, x='students', y='institution',
title=title,
color='aptitude',
text_auto='.s',
)
fig.update_layout(xaxis_title='Number of Students',yaxis_title='Aptitude')
fig = update_fig_layout(fig, "high_school", title)
fig.show()
This graph shows something interesting. Even though ITESO is a small new school, it demonstrates that their curriculum or culture is diverse and aims to develop multiple aptitudes in their students. This is unlike other institutions where the only aptitude highlighted is intellectual. I'm not saying that an intellectual aptitude is not important, but ITESO and "Colegio de Bachilleres 5" seem to be the only institutions where their students develop multiple types of abilities.